Add auto cv pipeline mulit buffer enable and preload support#620
Add auto cv pipeline mulit buffer enable and preload support#620zhangstevenunity wants to merge 13 commits intomainfrom
Conversation
PR #615 added the SyncCodegen scaffolding for multi-buffer set/wait_flag_dyn ops, but the upstream analysis path was inert: GetEventIdNum hard-coded return 1, and PTOIRTranslator dropped every pto.pointer_cast addr operand beyond the first. The N>1 multi-buffer code path was therefore unreachable. This change makes the PlanMemory -> InsertSync handshake actually work: - PTOIRTranslator::UpdatePointerCastOpMemInfo now translates *all* N pto.pointer_cast i64 addr operands into BaseMemInfo.baseAddresses (constants -> bit offsets, non-constants -> hasVariableAddress=true). Subview/alias deltaOffset is applied to every slot, not just slot 0. - MemoryDependentAnalyzer gains getMultiBufferSlotCount(a, b) implementing HIVM's per-slot geometry check: same-index slots must overlap (real back-edge dep on the same physical buffer) and different-index slots must NOT overlap (so consecutive iterations land in disjoint physical buffers). - InsertSyncAnalysis::GetEventIdNum is rewritten to follow HIVM semantics: every dependent pair must be multi-buffer-eligible, all pairs must agree on N, and every involved buffer must hang off the same scf.for. Verified end-to-end: a `pto.multi_buffer = 2` alloc inside scf.for now emits 2 reserved event ids, an `iv mod 2` arith.select chain, and `pto.{set,wait}_flag_dyn` ops driven by the selected idx. New regression guards this in test/lit/pto/multi_buffer_insert_sync_dyn_event_id.pto. Co-Authored-By: Claude <noreply@anthropic.com>
Builds on P0 to make the multi-buffer pipeline robust under realistic pressure: - SyncEventIdAllocation: HIVM-style TryFallbackMultiBuffer. When the event-id pool can't satisfy N, fall back to N=2 first (when N is even and >2) then to 1, instead of silently returning with an empty eventIds vec. Both set/wait siblings get their `eventIdNum` updated in lockstep so SyncCodegen takes the same code path on each side. - PTOPlanMemory: add `StorageEntry::isMultiBufferSlot` flag and rewrite `UpdateBuffer2Offsets` to walk via primaries and emit slots in declared `relationOtherBuffers` order. This makes the slot-ordering contract that EnableMultiBuffer relies on (`offsets[i]` <-> iv mod N selector index `i`) explicit and verifies it via a runtime assertion. - PTOEnableMultiBuffer: add scope guard (skip non-VEC/MAT casts where the iv-mod-N selector is not meaningful) and loop-invariance guard (skip casts whose addrs are not loop-invariant - hoisting them above the for loop would break SSA dominance). - New regression at test/lit/pto/multi_buffer_n4_insert_sync.pto: N=4 slot ordering [0, 1024, 2048, 3072] in pto.pointer_cast, 3-way arith.select chain, and 4 distinct event ids drained at function end. Co-Authored-By: Claude <noreply@anthropic.com>
Final hardening pass tightening multi-buffer correctness around the edges
PR-615 and earlier P0/P1 commits left ambiguous.
- InsertSyncAnalysis: GetEventIdNum now takes the back-edge scf.for and
requires every involved buffer's enclosing loop to *equal* that loop.
Previously a multi-buffer alloc nested inside an inner loop could
silently rotate slots on the wrong induction variable when a back-edge
on an outer loop carried the dep. Forward deps unconditionally collapse
to single-buffer.
- PTOEnableMultiBuffer: cache `iv mod N` per (loop, N) so several
multi-buffer pointer_casts sharing a loop reuse one counter, mirroring
SyncCodegen::loop2BufferCounter. Avoids redundant arith.remui +
constant ops in the loop body.
- PTOPlanMemory:
* `MemPlan::emitMultiBufferError` + `multiBufferDiagnosticEmitted_`
flag converts multi-buffer-specific `report_fatal_error` calls
(slot-order mismatch, pong outline placement failure) into
diagnostics that bubble through `plan()` as `failure()`. This lets
the existing PlanMemoryPass retry loop re-seed instead of aborting
the compiler under heavy multi-buffer memory pressure.
* `VerifyConflictStage1` now enumerates *every* historical
multi-buffer slot offset (via `CollectMultiRelationPongAnchors`)
as a SPEC_LEVEL_1 reuse anchor candidate. PR-615 only used the last
slot, silently dropping reuse for N > 2.
- New regression at test/lit/pto/multi_buffer_nested_loop.pto: nested
scf.for with multi-buffer alloc in the inner loop must rotate slots on
the inner iv, not the outer one.
Co-Authored-By: Claude <noreply@anthropic.com>
`VerifyConflictStage2`'s contract (per its own comment) is "block reuse only when buffers (a) share a parent loop AND (b) have a DMA pipe conflict". `PipeConflictInSameLoop` violated that on two counts: 1. After confirming same parent loop, it returned true unconditionally instead of querying `PipeConflict` for the actual DMA pipe-conflict relation. SPEC_LEVEL_2 was therefore stricter than SPEC_LEVEL_3 semantically: any same-loop pair was rejected, even when no DMA pipe conflict existed. 2. `GetBufferParentLoop` returns nullptr for top-level buffers; two top-level buffers compared parentLoop1 == parentLoop2 == nullptr, so the "different loops -> allow reuse" early-return was bypassed and the function fell through to "return true". Almost every cross-buffer pair at function scope was getting marked as conflicting, blocking valid reuse and causing local-memory planning to fail/overflow in cases that previously fit. Fix: reject the nullptr case explicitly, then fall through to the real `PipeConflict` query before declaring a conflict. 166/166 lit tests still pass; the change can only widen the set of allowed reuses, never narrow it. Co-Authored-By: Claude <noreply@anthropic.com>
After merging upstream/main, multi-buffer-enabled loops emitted spurious `pto.barrier <PIPE_MTE2>` and `pto.barrier <PIPE_MTE3>` ops in addition to the multi-buffer `set_flag_dyn` / `wait_flag_dyn` pair, e.g.: pto.wait_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] pto.barrier <PIPE_MTE2> <- redundant pto.tload ; (MTE2) pto.set_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.wait_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.barrier <PIPE_MTE3> <- redundant pto.tstore ; (MTE3) pto.set_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] Both barriers came from `InsertSyncOperation`'s same-pipe back-edge branch unconditionally emitting `PIPE_BARRIER` for any cross-iteration dep on a single pipe. Two of these were truly redundant: 1. Same-pipe back-edge whose dep is multi-buffer eligible. The whole point of multi-buffer is that consecutive iterations land in disjoint physical slots, so the cross-iter "same-address" dep is fundamentally false; the multi-buffer dyn flag pair already does the cross-iter ordering. No barrier needed. 2. Same-pipe back-edge on a DMA pipe (MTE1/MTE2/MTE3/MTE4/MTE5). DMA pipes are simple in-order command queues; the hardware itself serializes consecutive ops on the same DMA pipe across iterations. PIPE_BARRIER on a DMA pipe is a no-op and just clutters the IR. Fix: in `InsertSyncOperation`, compute the multi-buffer eventIdNum once (also resolves the back-edge scf.for so the cross-pipe path can reuse it - small refactor of the existing logic) and skip the barrier when either condition above holds. PIPE_M / PIPE_V keep the conservative PIPE_BARRIER to preserve the "bar_m" / "bar_v" intra-pipe semantics that higher-level frontends rely on (the existing comment in `IsNoNeedToInsertSync` calls this out). 168/168 lit pass, including all multi-buffer regressions and the upstream-merged comm/sync tests. Co-Authored-By: Claude <noreply@anthropic.com>
…ehmann-807dab # Conflicts: # lib/PTO/Transforms/CMakeLists.txt # tools/ptoas/ptoas.cpp
Wire multi-buffer support into the GraphSyncSolver path so it matches the
behaviour the InsertSync path got in earlier P0/P1/P2 commits. Mirrors the
intra-core subset of `bishengir/lib/Dialect/HIVM/Transforms/GraphSyncSolver`
(`SyncSolver::getEventIdInfo`, `getMultiBufferEventIdInfo`,
`getMultiBufferLoop`, `SyncSolverCodeGen::getMultiBufferSelectOp`).
Detection (Utility.h, SyncSolver.h/.cpp):
- `EventIdInfo` (eventIdNum + multibufferLoop) carried per ConflictPair.
- `getMultiBufferLoop` finds the common scf.for whose iv drives the slot
rotation; reuses the per-slot overlap helper added in P0
(MemoryDependentAnalyzer::getMultiBufferSlotCount), so the same
same-index-overlap / different-index-disjoint geometry rule applies.
- `getMultiBufferEventIdInfo` requires every dependent pair to agree on N
and share the same scf.for; failure -> single-buffer.
- `getEventIdInfo` is the top-level wrapper with the backward-only gate
(mirrors HIVM's "only optimise back-edge deps").
Solver wiring (handleSetWaitConflict):
- Threads rwOp1/rwOp2 in so MB info is computable.
- Allocates N event ids via the existing Welsh-Powell coloring node
(EventIdNode already supports eventIdNum > 1).
- Falls back to N=1, then PIPE_ALL barrier, if N ids cannot be coloured.
Codegen (SyncSolverCodeGen.h/.cpp):
- New `emitMultiBufferSetWait` emits the HIVM/InsertSync output shape:
pre-loop N pto.set_flag (queue prime), in-body iv-mod-N counter +
N-way arith.select chain + pto.{wait,set}_flag_dyn, post-loop N
pto.wait_flag (queue drain). The dyn ops live INSIDE the loop body so
they share the selector's dominance (GSS's default backward-sync
anchors are at the loop boundary, which works for single-buffer but
not for a per-iteration selector).
- `loop2BufferCounter_` cache reuses one `iv mod N` across multiple
ConflictPairs sharing a (loop, N) tuple.
Test:
- New `multi_buffer_gss_dyn_event_id.pto` is the GSS counterpart of the
existing InsertSync N=2 regression; checks pre-loop primes, the
selector chain, dyn set/wait, and post-loop drains.
191/191 lit pass.
Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request implements a CV Preload optimization framework for PTOAS, introducing new IR operations and passes to facilitate multi-buffering and software pipelining between Cube and Vector kernels. The implementation covers automatic multi-buffer marking, scope identification, and stage-guarded loop expansion, alongside significant updates to memory planning and synchronization solvers. Technical feedback highlights several critical areas for refinement: ensuring deterministic IR generation by avoiding DenseMap iteration, permitting pure operations like constants within loop bodies to align with the design specification, and addressing logic gaps in scope identification that currently miss leading or trailing operations. Furthermore, the reviewer recommends strengthening conflict analysis for top-level buffers to mitigate potential hardware hazards.
| SmallVector<LocalBufferBinding> orderedBindings; | ||
| orderedBindings.reserve(localBindings.size()); | ||
| for (const auto &it : localBindings) |
There was a problem hiding this comment.
The iteration over localBindings (which is a DenseMap) is non-deterministic. This causes the order of cloned PointerCastOp and BindTileOp operations at the beginning of the expanded loop body to vary across different compiler runs. While the resulting code is functionally equivalent, non-deterministic IR generation makes debugging, testing, and IR diffing significantly more difficult.
Consider using a MapVector or maintaining a separate SmallVector to track the insertion order of bindings during the walk.
| if (!scope) { | ||
| return op.emitOpError( | ||
| "non-CV-scope op cannot be preserved by pto-cv-create-preload yet"); | ||
| } |
There was a problem hiding this comment.
This check is overly restrictive. It causes the pass to fail if the loop body contains any operation not wrapped in a pto.cv.scope, including pure operations like arith.constant or index arithmetic. The design document (Section 7.8) explicitly mentions that constants and index arithmetic should be allowed and cloned for each stage.
Restricting the loop body to only CVScopeOp makes the pass fragile and forces users to wrap trivial setup code into scopes unnecessarily.
| if (action.isTPop) { | ||
| if (pending.active) | ||
| includeMissingThrough(pending, op.getPrevNode()); | ||
| collectScopeIfValid(pending, core, scopes); | ||
| pending.input = action.input; | ||
| includeRange(pending, &op, &op); | ||
| segmentStart = &op; | ||
| continue; |
There was a problem hiding this comment.
In collectScopesInFor, when the first TPipe operation encountered is a tpop, any operations preceding it in the loop body (e.g., induction variable adjustments or constant definitions) are not included in any scope. Since CVCreatePreloadPass requires all operations in the loop body to be wrapped in scopes, this gap will cause the expansion pass to fail on valid IR.
The logic should ensure that leading operations are either pulled into the first scope or that the identification process covers the entire block range.
| if (pending.active) | ||
| includeMissingThrough(pending, terminator ? terminator->getPrevNode() | ||
| : nullptr); | ||
| collectScopeIfValid(pending, core, scopes); |
There was a problem hiding this comment.
Similar to the issue with leading operations, if the loop body contains non-TPipe operations after the last identified scope has been closed (i.e., pending.active is false), these trailing operations are missed and left unwrapped. This will lead to expansion failures in the subsequent pto-cv-create-preload pass.
| if (!parentLoop1 || !parentLoop2) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
In PipeConflictInSameLoop, returning false when both parentLoop1 and parentLoop2 are nullptr allows top-level buffers (defined at function scope) to reuse the same memory offsets even if they have conflicting pipe usage and overlapping lifetimes.
While SPEC_LEVEL_2 is intended to be less restrictive than SPEC_LEVEL_3, top-level buffers are effectively in the same execution scope. Allowing them to alias while being used by conflicting hardware units (e.g., MTE and Vector) simultaneously could lead to hardware hazards. It would be safer to check PipeConflict if both are nullptr.
Codex Review该评论由 review 机器人自动更新。
SummaryPR #620 introduces two P1 correctness regressions in the new CV multi-buffer/preload pipeline and one P2 error-handling bug in PlanMemory. Findings
|
P0 fixes (correctness, untested edges):
* CVCreatePreload: use llvm::MapVector for LocalBufferBinding so the cloned
per-stage pointer_cast / bind_tile order is deterministic across builds
(DenseMap iteration order had been leaking pointer-hash into the IR).
* Multi-buffer slot index now computes ((iv - lb) / step) mod N instead of
iv mod N. The previous form silently dropped slots when gcd(step, N) > 1
(e.g. step=2/N=4 only ever selected {0,2}). Fixed in three codegen paths
(PTOEnableMultiBuffer, GSS SyncSolverCodeGen, InsertSync SyncCodegen);
step=1, lb=0 still takes the single-remui fast path.
* GraphSyncSolver multi-buffer codegen now preserves the original sync
anchors. cp->op1 / cp->op2 are syncIR-ordered, not MLIR-lexical-ordered;
resolve via Operation::isBeforeInBlock so wait_flag_dyn lands before the
iteration's first user and set_flag_dyn after the iteration's last user.
Anchors in different blocks (e.g. nested scf.if) fall back to the prior
body-start / before-terminator placement.
P1 fixes (semantic gaps):
* SyncSolver::getMultiBufferEventIdInfo now matches HIVM and PTO InsertSync:
any dep pair with n<2 collapses the whole pair to single-buffer rather
than being silently skipped.
* CVMarkPreloadScopes: ScopeClass key now includes parent scf.for so two
unrelated loops sharing the same (core, input, output) pipe stay in
separate classes; fan-out / fan-in along a logical pipe is detected and
emits a diagnostic instead of silently dropping siblings.
* CVMarkPreloadScopes: TPipe ops nested under scf.if / scf.for inside the
scope-recognition for now emit a warning and skip auto-marking, instead
of being missed by the direct-children walk.
* CVCreatePreload: stage-rotated multi-address pointer_casts are hoisted
above the rotation loop instead of being duplicated max*N times in the
body. PTOEnableMultiBuffer learns to infer the rotation loop from a
loop-invariant cast's users so the lowering still finds it.
* CVMarkPreloadScopes::canWrapNoResultScope walks every op (including ops
inside nested regions of the moved range) when checking for SSA escapes,
not just the top-level moved ops.
Tests:
* New regression test/lit/pto/multi_buffer_step_not_one.pto covers the
step!=1 slot-index path.
* Existing cv_create_preload_scope_pipeline.pto updated for the hoisted
stage pointer_cast layout.
* Full lit suite (194 tests) passes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No description provided.